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(57) Abstract 



Methods of screening for a. tumor or tumor progression to the metastatic state are disclosed. The screening methods are based 
the characterization of DNA by principal components analysis of spectral data yielded by Fourier transform-infrared spectroscopy of 
DNA samples. The methods are applicable to a wide variety of DNA samples and cancer types. A model developed using multivariate 
normal distribution equations and discriminant analysis is particularly well suited for distinguishing primary cancerous tissue from metastatic 
cancerous tissue. 
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METHODS OF DIFFERENTIATING METASTATIC 
AND NON-METASTATIC TUMORS 

TECHNICAL FIELD 

The present invention is generally directed toward tumor identification, 
5 including tumor detection and characterization. The invention is more particularly 
related to characterizing DNA based upon principal components analysis of spectral 
data yielded by Fourier transform-infrared spectroscopy of DNA samples, in order to 
screen for a tumor or progression of a tumor to the metastatic state. 

BACKGROUND OF THE INVENTION 
10 Despite enormous expenditures of both financial and human resources 

over the last twenty-five plus years, the detection of new. tumors or the recurrence of 
tumors remains an unfiilfiUed goal of humankind. Particularly fi^strating is the fact 
that a number of cancers are treatable if detected at an early stage, but go undetected.in 
many patients for lack of a reliable screening procedure. In addition, the need is acute 
15 for reliable screening procedures that discriminate non-metastatic primary tumors (or 
non-cancerous disease states) from metastatic tumors, or are predictive of progression to 
the metastatic state. Metastasis of tumors is a major cause of treatment failure in cancer 
patients. It is a complex process involving the detachment of cells from the primary 
neoplasm, their entrance into the circulation, and the eventual colonization of local and 

20 distant tissue sites. 

Frequently, physicians must err on the side of caution, and request that a 
patient undergo surgical or' other procedures that dramatically affects the patient's / 
quality of life, without identification of the disease state as a tumor with a propensity to 
progress to the metastatic state. For illustrative purposeSi two particular cancers, 

25 prostate and^ breast cancers, are described in more detail and are representative of 
cancers in need of new approaches, which the invention disclosed herein provides. 

Prostate cancer is a leading cause of death in men. Thus, there is a keen 

I , - 

interest in the etiology of this disease, as well as in the development of techniques for 
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predicting its occurrence at earjy stages of oncogenesis. Little is known about the 
etiology of prostate cancer, the most prevalent form being adenocarcinoma. However, 
several studies have focused on inactivation of the tumor suppressor gene TP53 and 
altered DNA methylation patterns as possible factors. In addition, free radicals, arising 
5 from redox cycling of hormones, have recently been implicated in prostate cancer. This 
is consistent with evidence showing that the hydroxyl radical (•OH) produces 
mutagenic alterations in DNA, such as 8-hydroxyguanine (8-OH-Gua) and 
8-hydroxyadenine (8-OH-Ade), that have been linked to carcinogenesis in a variety of 
studies. Despite these findings, virtually no xmderstanding exists of the possible 

10 relationship between the •OH-modification of DNAaiid prostate cancer. 

Prostate tissue may contain areas of benign prostatic hyperplasia (BPH), 
which is not regarded as a pre-malignant Ipsion, although it often accompanies prostate 
cancer. The etidlogy of BPH is unknown, as is its relationship to prostate cancer. Due 
to the difficulties in the cuirent approaches to the diagnosis of prostate cancer, there is a 

15 need in the art for improved methods. The present invention fiilfiUs this need, and 
further provides other related advantages. 

Breast cancer is a leadhig cause of death in women and is the most 
common malignancy in womm. The incidence for developing breast cancer is on the 
rise. One in nine women will be diagnosed with the disease. Standard approaches to 

20 treat breast cancer have centered around a combination of surgery, radiation and 
chemotherapy. In certain malignancies, these approaches have been successful and 
have effected a cure. However, when diagnosis is beyond a certain stage, breast cancer 
is most often incurable. Invasive ductal carcinoma is a common form of breast cancer 
which can metastasize. Altemativie approaches to early detection are needed-. Due to 

25 the difficulties in the current approaches to the diagnosis of breast cancer, there is a 
need in the art for improved methods. The present invention fulfills this need, and- 
further provides other related advantages. 

DNA is continually being modified by microenviro'nmental factors, thus 
creating vast nuinbers of modified structures (ref. 1,2). For example, the progression of 

30 primary breast cancer to the metastatic state was estimated to involve as many as 
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several billion new DNA forms, many of which likely result from hydroxyl radical 
(♦OH)-induced structural alterations (ref. 2). Progress has been made in analyzing low 
mass oligonucleotides (< 1 x 10^ base pairs) (ref. 3). However, the complexity and high 
masses of the cellular DNAs («6x 10^ base pairs) have hindered their structural 
5 elucidation. Consequently, an understanding of these DNAs had to be obtained 
primarily by. using destructive techniques (chemical or enzymatic) that provide httle 
information on intact structures potentially having important biological properties. 

Tlie development of an infrared microscope spectrometer (Fig, 14), 
coupled with advanced computer software, made it possible to obtain Fourier transform- 
10 infrared (FTJR) spectra from micrograms of cellular DNA (e.g,, from biopsy 
specimens). 

SUMMARY OF THE INVENTION 

Briefly stated, the present invention provides methods for defining the. 
state of tissue, and assessing the genotoxicity of an environment. The inventive 

15 methods are particularly well suited for differentiating a T-1 (primary, non-metastatic) 
tumor from a metastatic tumor. The invention is applicable to a vnde variety of DNA 
samples and cancers, and to a wide variety of genotoxic enviroiunents. 

In orie aspect, the present invention employs the so-called "centroid" 
model (which may also be called the "sigmoid curve model") with which tissue samples 

20 are analyzed. According to the centroid model, there is provided a method of screening 
for a tumor or tumor progression to the metastatic state comprising the steps of: 
(a) subjecting a DNA sample to Foiirier transform-infrared (FT-IR) spectroscopy to 
produce FT-IR spectral data; (b) analyzing the FT-IR spectral data of step (a) by 
principal components analysis (PCA); and (c) comparing the PCA of step (b) to the 

25 PCA of FT-IR spectra for DNA samples from non-cancerous, non-metastatic tumor or 
metastatic tumor samples. 

In another aspect, the present invention provides a so-called, "ellipsoid . 
model" for characterizing the state of a tissue. In this aspect, the invention provides a 
mathematical descriptibn corresponding to various defined states of a tissue of interest, 
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z.e., a model. Defined states of a tissue include, e.g.y normal prostate tissue, benign 
prostatic hyperplasia and metastatic prostate cancer, where . "hormar', "benign 
hyperplasia" and "metastatic" are three "defined states", and prostate tissue is the 
"tissue of interest". 

5 In brief, according to the ellipsoid model, the invention provides a 

method for defining the state, e.g., the physiological state, of a tissue, comprising the 
steps of: 

(a) subjecting DNA fi-om a first plurality of tissue samples to Fourier 
transform-infirared (FT-IR) spectroscopy to produce FT-IR spectral data; 
10 (b) analyzing the FT-IR spectral data of step (a) by principal 

components analysis (PCA) to provide a principal component (PC) scores; 

(c) applying cluster analysis to the PC scores of step (b) to 
distinguish outlier and non-outlier tiissue samples; and 

(d) generating an equation, called a first equation, that defines a 
15 multivariate version of a normal bell-shaped curve which best fits the PC values from 

the non-oiitlier tissue samples, where the first equation defines the state of the first 
plurality of tissue samples. . 

In another embodiment, the method further includes repeating st^s (a) 
through (d) above with a second plurality of tissue samples, to prdvide a second 

20 equation, where the second equation defines the state of the second plurality of tissue 
samples. In another embodiment, the method further includes the step of applying 
multivariate discrimination analysis to the first and second equations; to provide first 
and second probability equations, respectively. In another embodiment, the method 
further includes the steps of: (e) subjecting a DNA sample from a tissue having a state 

25 of interest to FT-IR spectroscopy to produce FT-IR spectral data; (f) analyzing the FT- 
IR spectral data of step (e) by PCA to provide a set of PC scores; and (g) combining the. 
PC scores of step (f) with each of the first and second probability equations to provide 
first and second probability scores, respectively. 
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In a preferred embodiment, the inventive method provides a means for 
defining (characterizing) DNA fi-oni tissues^ and hence defining the tissue itself, where 

the method includes the steps of: 

(a) subjecting a plurahty ("m'O of DNA samples jfrom a first of "n" 

5 defined states of a tissue of interest- {e.g., samples of normal prostate tissue firom "m" 
different individuals) each to Fourier . transformrinfi^ared (FT-IR) spectroscopy to 
produce FT-IR spectral data; 

(b) independently analyzing the FT-IR spectral data from each 
sample of step (a) by principal components analysis (PCA) to provide a plurahty ("o") 

10 of principal component (PC) scores (z.e., PCI, PC2, PC3 ... PCo scores) from each of 
the "m" FT-IR spectra, every sample being characterized by an identical number of PC 
scores as obtained by the identical treatment of the FT-IR spectral data, to provide "m" 
sets of PC scores, each set containing "o" values; 

(c) applying cluster analysis to the set of PC scores from the "n" 
15 defined states of the tissue of interest (f.e., to all of the PCI to PCo scores obtained from 

the FT-IR spectra of the "m** samples of DNA) as obtained from all of the samples, tp 
identify outlier and non-outlier tissue samples; 

(d) generating an equation defining a multivariate version of a 
. normalbell-shaped curve which best fits the non-outlier PCI PCo values for all of the 

20 samples in the first defined state; 

(e) repeatirig steps (c) and (d) for each of the sets of PC scores 
obtained from step (b), to define a set of "n" equations, each of the "n" equations 

. defining a multivariate version of a normal bell-sh^ed curve corresponding to each of 
the "n" sets of PC scores; and 
25 (f) applying multivariate discriminant analysis to the "n" equations 

defining multivariate versions of normal bell-shaped curves of step (e), to define a 
probability equation for the each of the "n" defined states of the tissue of interest. 

According, to the procedure outlined above (steps (a) through (0), a 
probability equation is generated corresponding to each defined state of interest for a 



WD 99/00660 



PCT/US98/13386 



particular tissue of interest, where in combination these *'n" probability equations define 
a model. 

/ A sample of tissue of interest having an unknown defined, state is then 

analyzed by FT-IR, and the spectral data obtained thereby is subjected to principal 
5 components analysis to define "o" PC scores. These "o" PC scores are then "plugged 
into" each of the "n" probability equations corresponding to the various defined states 
within the model for the same tissue of interest, to provide a number ("n") of probabiKty 
scores corresponding to the number of defined states firom which the model was 
constructed. A probability score is thus obtained for each of the defined states of the 
10 model. A higher probability score indicates a higher likelihood that , the tissue of 
interest is properly charactCTized by the defined state corresponding to the probability 
equation. For example, if plugging the PC scores into the probability equation 
corresponding to normal tissue provides a probability score of 'Sv", and if plugging 
those same PC scores into the prbbability equation corresponding to metastatic cancer 
15 provides a probability score of "x", and '"x" < **w", then the sample is. more likely to be 
- normal tissue than metastatic cancer. 

Thus, the invention further provides a method comprising the steps of 

(1) performing step (a) through (f) above, to provide a model 
comprising a number "n" of probability equations corresponding to a number "n** of 

20 defined states for a particular tissue of interest; 

(2) performing steps (g) through (j), as follows: 

(g) subjecting a DNA sample fi-om a tissue of interest having an 
unknown defined state, to Fourier transform-infi^ared (FT-IR) spectroscopy to produce 
FT-IR spectral data; 

25 (h) analyzing the FT-IR spectral data of step (g) by principal 

components analysis (PC A) to provide a plurality ("o") of principal component (PC) 
scores (i.e., PCI , PC2, PC3 . . , PCo scores), to provide a set of "o" PC scores; 

(i) "plugging in" the set of "o" PC score of step (h) into each of the 
"n'* probabihty equations which compose the model of step (f) to obtain a probability 

30 score corresponding to each of the '"n" defined states; and 
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(j) comparing the "n" probability scores from step (i) to one another 
in order to determine the most Ukely defined state into which the tissue having an 
unknown defined state is a member. 

In any of the above methods, the tissue may be breast, urogenital, liver, 
5 renal, pancreatic, lung, blood, brain or colorectal tissue. In one embodiment, the tissue 
is cancerous, for example, cancerous breast, prostate, ovarian or endometrial tissue. 

In another embodiment, the invention provides a method for assessing 
the genotoxicity of an enviroimient. The method includes the steps of: 

(a) subjecting DNA from a plurality of first organism in a first 
10 environment to Fourier transfomi-infrared (FT-IR) spectroscopy to produce FT-IR 

spectral data; 

(b) analyzing the FT-IR spectral data of step (a) by principal 
\ components analysis (PCA) to provide a principal component (PC) scores; 

(c) applying cluster analysis to the PC scores . of step (b) to 
15 distinguish outlier and non-outlier organisms; and 

(d) . generating an equation, called a first equation, that defines a 
multivariate version of a normal bell-shaped curve which best fits the PC values from 
the non-outlier organisms, where the first equation defines the first organisms in the 

. first environment. 

20 In one. embodiment, the invention fiurther includes repeating steps (a) 

through (d) above with DNA samples fix)ra second organisms taken from a second 
environment, to provide a second equation, where the second equation defines the state 
Qf the second organisms in the second environment. In another embodiment, the 
invention fiirther includes applying multivariate discrimination analysis to the first and 

25 second equations, .to provide first arid second probability equations, respectively. In 
another embodiment, the invention provides a method that further includes the steps of: 
(e) subjecting a DNA sample of an organism of interest from an environment of interest . 
to FT-IR spectroscopy to produce FT-BR spectral data; (f) analyzing the FT-IR spectral 
data of step (e) by PCA to -provide a set of PC scores; and (g) combining the PC scores ■ 
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of step (f) with each of the first and second probability equations to provide first and 
second probability scores, respectively. 

In optional embodiments, at least one of the first and second 
environments is a polluted environment. In another optional embodiment, the first, and 

* 5 second organisms are non-identical, however the first and second environments are 
identical. In another optional embodiment, the first and second organisms are identical, 
however the first and second environments are non-identical. 

Thus, in a prefeired embodiment, the present invention provides a 
method for assessing the genotoxicity of an environment. The method is essentially as 

10 described above, uses the.centroid or elhpsoid model, howevei- the DNA samples 
are . from organisms taken jfrom various environments. As one example, the 
environments may suffer fi-om various degrees of pollution. In any event, according to 
the centroid model, the method comprises the steps of: (a) subjecting a DNA sample of 
a first organism in an environment to Fourier transform-infrared (FT^IR) spectroscopy 

15 to produce FT-IR spectral data; (b) analyzing the FT-IR spectral data of step (a) by 
principal components analysis (PCA); and (c) comparing the PCA of step (b) to the 
PCA of FT-IR spectra for DNA samples of: (1) the first organism prior to introduction 
in the ravironment of step (a), or (2) a second organism in a nonpoUuted environment. 
The ellipsoid model may likewise be used in a metiiod for assessing the genotoxicity of 

20 an environment 

These and other aspects of the present invention will become evident 
upon reference to the following detailed description and attached drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 shows a two-dimensional PC plot derived by PCA/FT-IR 
25 spectral analysis showing distinct clustering of normal, benign prostatic hyperplasia 
("BPH") and prostate cancer points. Notably, both of the groups of prostate lesions 
occur to the right of the points for the DNA of normal prostate. 

Figure 2 shows a comparison of the mean spectrum of prostate cancer vs. 
normal tissue (Figure 2A), BPH vs. normal tissue (Figure 2B) and prostate cancer vs. 
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BPH (Figure 2C). The lower plot of each panel (A-C) shows the statistical significance 
of the difference in mean absorbance at each wavenumber, based on the unequal 
variance t-test. P-values are plotted on the l6g,o scale. 

.. Figure 3 shows Sigmoid curves depicting the probability, of DNA being 

5 classified as normal tissue versus prostate canca: (Figure 3A), normal tissue versus 
BPH (Figure 3B). and BPH versus prostate cancer (Figure 3C). The curves are based 
on the logistic regression models depicted in Table 2 below. The predicted probabilities 
rise very rapidly over a narrow range, which reHects a high degree of discrimination 
among groups and a precipitous change in DNA structure associated with the normal to 

1 0 BPH and normal to prostate cancer progressions. Each sample is plotted at its predicted 
probability. 

Figure 4 is a three-dimensional plot of PC 1. 2 and 3 wherein each 
sphere represents a DNA absorbance spectrum and the location of a sphere is 
determined by the "shape" of the spectrum, including height, width and location of 

15 absorbance peaks. The core cluster of non-invasive ductal carcinoma of the breast 
("jpC") spheres in the upper part of the plot (medium stipple) is significantly smaller 
than the more diverse and larger IDC„ cluster (heavy stipple), and the reduction 
mammoplasty tissue ("RMT") and metastatic invasive ductal carcinoma ("IDC„") 
clusters substantially overlap and gre not statistically different in size; 

20 Figure 5A shows two spatially close IDC spectra (see arrows indicating 

A and B on the three-dimensional PCA plot) wherein.the two overlaid spectra shown in 
Figure 5B differ by a mean of only 3% in normalized absorbance, demonstrating the 
high specificity of the PCA- and the fact that spatially close spheres have almost 
identical spectral profiles; 

25 Figures 6A and 6B show the spectral profiles of two IDC outliers 

(identified in Figure 5) compared to the specti-al profile of the mean IDC core cluster; " 
1 " represents a multifocal carcinoma, with one focus being a highly malignant signet 
ring cell carcinoma, and "2" represents a bilateral breast cancer. In each case, the 
dramatic difference between tiie mean and outlier spectrum is apparent over most of the 
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spectral region (see text for v/avenuniber - structural relationships) illustrating the 
pronounced structural specificity associated with the PCs analysis; 

Figure 7 shows a centroid calculation of the spectra for the RMT, IDC, 
and IDC^ specimens on a graph plotting PC2 vs. PCI, and the direction vectors from 
5 the RMT centroid to the IDC centroid, and the IDC centroid to the IDC^ centroid; 

Figure 8 shows a centroid spectra overlay for the average RMT, IDC, 
and IDC„ species; 

Figure 9 shows a centroid spectra overlay for the average RMT, IDC, 
and IDC^ species after subtracting the mean, thus emphasizing the spectral differences 
10 between the species; 

■ Figure 10 shows the predicted probabilities of cancer based on FT-IR 

methodology; 

Figure 1 1 shows a three-dimensional projection of the clusters of points 
derived from the. first three PC scores, which summarize spectral features of the DNA 
15 from English sole inhabiting an essentially clean control environment (QMH group) or 
, inhabiting a chemically contaminated urban enviromnent (DUW group); 

Figures 12A-12C show a comparison of the mean spectrum for each of a 
QMH group and a DUW group. The lower plot of each panel shows the statistical 
significance of the difference in mean absoibance at each wavenumber, based on the 
20 unequal variance t-test. P-values are plotted on. the logjo scale; 

. Figures 13 shows overlays of the individual spectra of QMH and DUW 

groups; 

Figures 14 provides a picture and schematic diagram of a FT-IR 
microscope spectrometer. Figure 14A shows two overlaid grand mean spectra, while 
25 Figure 14B provides P-values obtained for each wavenumber using the imequal 
variance /-test. 

Figure 15A shows a three-dimensional PC plot of a breast cancer (IDC) 
cluster including two specimens with very similar PC scores designated "a" and "6". 
.There are also two outliers: "c" represents the DNA of an IDC tissue from a patient with 
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bilateral breast cancer and "if' the DNA of a multifocal carcinoma, one focus being a 
highly malignant signet ring cell carcinoma; ^ 

Figure 15B shows that the spectra "a" and "6;* differ by only 3% of mean 
normalized absorbance. Although the two spectra, are virtually identical, their 
5 corresponding PC points are spatially distinct, thus demonstrating the high spectral 
specificity achieved with PCA; 

Figure 15C provides the spectrum of outlier "c" (from Figure 15 A) 
compared with the mean spectrum of the IDC core cluster (without the outliers); 

Figure 15D show the spectrum of outlier "tf* (frorii Figure 15 A) 
10 compared with the mean spectrum of the EDC core cluster (without the outliers). The 
dramatic differences between the mean and outlier, spectra are apparent over most of the 
spectral region, resulting in the two corresponding PC points being far away from the 
main cluster. 

Figure 16A is a three-dimensional plot of PC scores of DNA from 

15 normal breast (/i = 21) and breast cancer (IDC; /i = 37) tissues showing distinct 
clustering of each group, together with the two outliers (c and d) shown in Fig. 1 5 A 

Figure 16B is a plot of the probability of cancer with the risk score for 
the normal breast and breast cancer. The cancer samples are mainly located at the upper 
portion of the sigmoid curve where the probability of cancer is > 61.5%, whereas the 

20 normal breast samples are situated primarily in the lower portion. The null hypothesis 
that the PC scores do not discriminate between the groups is rejected with P < 0.0001 ; 

Figure. 16C is a two-dimensional plot of PC scores of DNAs from 
normal prostate (« = 5), BPH (« = 18) and prostate cancer (adenocarcinoma; w = 8) in 
which the clustering is distinct (4); . 

25 . . Figure 16D is a plot of the probability of cancer vs. the risk score for 

normal prostate and prostate cancer. The null hypothesis that the PC scores do not 
discriminate between the groups is rejected, with P = 0.04. The cancer outlier on the 
right side of the plot in Figure 16C is in the same direction as the progressions from 
normal to cancer in the probability curve. This suggests that the DNA represented by 

30 this outlier has a high, degree of structural modification. 



wo 99/00660 



PCT/US98/13386 



12 - 

Figure 17 is a three-dimensional representation of DNA spectrum for 
IDC and IDCM (in analogy with Figure. 16A, which provides a similar three- 
dimensional representation for normal breast tissue and breast caiicer). 

Figure 18 is a plot obtained from a two-component ellipsoid model, for 
5 discriminating metastatic breast cancer (BDCm) and reduction mammoplasty tissue 
(RMT); 

Figure 1 9 is a plot obtained from a two-component ellipsoid model for 
discriminating primary breast cancer (IDC) and metastatic breast cancer (IDCm); 

Figure 20 is a plot obtained from a three-component ellipsoid model for 
10 discriminating IDC, IDCm RMT tissues; 

Figure 21 is a plot obtained from a three-component ellipsoid model for 
discriminating between normal (RMT), primary (BDC) and metastatic (IDCm) breast 
cancer; 

Figures 22 show plots of 100 simulated normal, IDC and IDC^ cases 
IS based on the multivariate normal model {i.e., the ellipsoid model) 

DETAILED DESCRIPTION OF THE INTV^NtlON 

As noted above, the present invention is directed, in one aspect, toward 
methods of screening for a tumor or tumor progression to the metastatic state. The 

4 

methods are based on the analysis of DNA. Because DNA is ubiquitous in all 
20 organisms, the methods of the. invention are not limited to use of a particular DNA 
sample. Thus, a wide variety of cancers may be screened. Representative examples of 
cancers include breast, urogenital, melanoma, liver, renal, pancreatic, lung, circulation 
system, nervous system or colorectal cancers. Urogenital cancers include prostate, 
cervical, ovarian, bladder or endometrial cancers. Circulation system cancers include 
25 lymphomas. Nervous system cancers include brain cancers. . • 

As used herein, the term "screening for*' includes detecting, monitoring, 
diagnosing or prognosticating (predicting). DNA is anal)^ed as described herein to 
screen for a tumor. As used herein, "a tumor" may be present for the first time, or 
reoccurring, of in the process of occurring or reoccurring. The last scenarios (/.e.. 
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process of) represent opportunities , for assessing, and insight into, the risk of cancer 
prior to clinical manifestation. The present invention may be used to predict that cancer 
cells are likely tb form, even though they have yet to appear based on currently 
available methodologies. DNA is also analyzed as described herein to screen for tumor 
5 progression to the metastatic state. Progression of the tumor to the metastatic state 
refers to the end pomt (i.e., the metastatic state) as well as any intermediate point on the 

way to the end point. 

The term "screening" fiirther includes differentiating a metastatic and 
non-metastatic tumor. The so-called ellipsoid model, as described herein, is particularly 

10 preferred for this aspect of screening. In fact, using the ellipsoid model, normal tissue 
was correctly identified 89% of the time (16 of 18 samples) while cancer tissue was 
correctly identified 97% of the time (31 of 32 samples). In addition, using the ellipsoid 
model, primary (IDC) cancer was correctly identified 100% of the time (10 of 10 
samples) while metastatic (IDCm) cancer was correctly identified 82% of the time. 

15 A "DNA sample" is DNA in, or firom, any source. DNA may be 

removed fi-om a variety of sources, including a tissue source or a fluid source. Tissue 
sources include tissue from an organ or membrane or skin. Fluid sources include whole 
blood, serum, plasma, urine,, synovial, saliva, sputum, cerebrospinal fluid, or fi-actions 
thereof With, respect to a tissue sample, for example, tissue may be removed firom an 

20 organism by biopsy (such as a fine needle biopsy) and the DNA extracted, all by 
techniques well known to those in the art. Sifnilarly DNA may be extracted firom a 
fluid source using known techniques.. Although extraction/isolation of DNA may be 
preferred, DNA need not be extracted/isolated in order to carry out the invention. It is 
possible to examine DNA directly using Fourier transform-infirared (FT-IR) 

25 spedtroscopy. For example, by specifically Umiting the' IR scan to cellular nuclei. 

spectral profiles of high concentration may be generated. Therefore, a DNA sample 

may be extracted/isolated DNA or a sample may include DNA. 

It is possible to store tissue for later analysis of the DNA. For example, 

excised tissue may be fi-ozen immediately in liquid nitrogen and maintained at -80»C. 
30 Following isolation of the DNA fi-om such tissue, it is normally dissolved in deionized 
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water and aliquoted into portions for FT-IR spectroscopy. Aliquots are typically dried 
completely by lyophilization, purged with pure nitrogen and stored in an evacuated, 
sealed glass vial. 

• Within, the present invention a DNA sample is subjected to FT-IR 
5 spectroscopy and the FT-IR spectral data analyzed by principal componrats analysis. 
The starting point for the characterization of DNA in a sample is a set of ER spectra. 
^ Each spectrum shows numerical absorbances . at each integer wavenumber, i.e,, 
generally from 4000-700 cin"' and typically from 2000-700 cm*^ Infrared (IR) spectra 
of DNA samples are obtained with a Fourier Transform-IR spectrometer, for example a 
JO Perkin-Ehner System 2000 (The Perkin-Ekner Corp., Norwalk, CT) equipped with an 
IR microscope and a wide-range mercuiy-cadmium-telluride detector. The DNA is 
generally placed on a barium fluoride plate in an atmosphere with a relative humidity of 
less than --60% and flattened to make a transparent film. Using the IR microscope in a 
visual-observation mode, a uniform and transparent portion of the sample is selected to 
15 avoid a scattering or wedge effect in obtaining transmission spectra. Bach analysis is 
generally performed in triplicate on 3-5 jig of DNA and the spectra were computer 
averaged. Generally, two hundred fifty-six scans at a 4- cm"^. resolution are performed 
for each analysis to obtain spectra in a frequency range of 4000-700 cwT^. Typically 
3-5 minutes elapsed from when the glass vial is broken to when each ER spectrum is 
20 obtained. Typically, the DNA specimens vary, in thickness, yielding a diverse set of 
. absorbances or spectral intensities. None of the IR spectra show a 1703-cm'' band, 
which is indicative of specific base pairing. This fact indicates that the samples have 
acquired a disordered fomi, the D-configuration. 

The IR' spectra \are obtained in transmission imits and converted to 
25 absorbance units for data processing. For example, the Infrared Data Manager software 
package (The Perkin-Elmer Coip.) may be used to control the spectrometer and to 
obtain the IR spectra. Additionally, the GRAMS/2000 software package (Galactic 
Industries Corp., Salem, NH) may be used to perform postrun spectrographic data 
analysis. Each spectrum is converted to a spreadsheet foraiat that includes a specific 
30 absorbance for every wavenumber from 4000 to 700 cm"' . 
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In processing the IR data, a baseline adjustment is generally used for all 
spectra to remove the effect of background absorbance. In order to do this, the mean 
absorbance across 11 wavenumbers, centered at the lowest point (e^g., for the range 
2000-700 cm"^) is subtracted from absorbances at all frequencies. In addition, the.IR 
5 data is generally normalized. Because there is not a well-established reference peak in 
the frequency range of 2000-700 cm"^ useftil for normalization, generally normalization 
is achieved by converting all absorbances to a constant mean intensity in the range of 
interest. For example, the region of 1750-700 cm*^ (a span of 1051 wavenumbers) has 
been typically chosen within the present invention as the primary region for analysis, 

10 because it includes widely varying absorbances. After the removal of a baseline, 
described above, absorbances at all wavenumbers in a spectrum are divided by the mean 
absorbance ranging form 1750 to 700 cm"^ for that spectrum, resulting in a mean 
spectral intensity of 1.0 for every specimen. All further analyses are generally 
performed on these baselined, normalized spectra (although analysis without the mean 

15 removed is also possible). 

Within the present invention, factor analysis is used to study the 
variation among spectra and the relation of this variation to subgroups, such as cancer 
versus non-cancer. In particuliar, spectral data acquired by PT-IR spectroscopy are 
analyzed using, a principal components inalysis (PCA) statistical approach. PCA is a 

20 statistical procedure applied to a single set of variables with the purpose of revealing a 
few variables (principal component scores or PCs) that are independent of each other 
and that capture most of the information in the original long list of variables. (e.g., 
Timm, N.H. in Multivalent Analysis, ed. Timm, N.H., 1975, Brooks/Cole, Monterey, 
CA, pp. 528-570). PCA yields a few PCs that summarize the major features that vary 

25 across spectra. PCA may be based on over a milUon correlations between absorbance- 
wavenumber values over the entire infrared spectrum. Numerous variables comprising 
the complex spectral relationships are reduced to a few PC scores. Each PC score is the 
weighted sum of the wavenumber-by-wavenumber deviations of a spectrum from the 
grand mean spectrum. Each PC score appears as a point in two- and three-dimensional 
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PC plots and represents a group of distinct and highly discriminating structural 
properties of DNA. 

For example, five principal components {i.e., five dimensions) can be 
sufficient to describe 1051 dimensions of FT-IR spectra (with tiie grand mean of all 
spectra subtracted fi-om each spectrum) and visual representation in two or three 
dimensions is adequate. PCA is available in many basic and advanced statistical 
programs, such as SAS and S-Plus. 

The entire analysis is generally carried out with core clusters from each 
of the three groups (DNA fi"om non-cancerous samples, non-metastatic tumor samples, 
and metastatic tumor samples), although it is possible to use more or less &an all three 
groups (e.^., two of three groups, or non-cancerous samples versus all tumor samples 
regardless of whether metastatic or not). Using cluster analysis, those members of a 
specified group that stood apart fi-om others in the pore group are identified. The 
isolated group members all stand apart from aiiy others in their group at Euclidean 
distances generally representing at least a 12% difference in the. mean normalized 
absorbance, a visibly notable difference when spectra are conventionally plotted. The 
core clusters can be considered to be the more commonly encountered DNA structural 
phenotypes, whereas the isolated group members ("outliers") represent less fi-equent 
phenotypes not present in great enough numbers to study with the sample, yet overly 
influential in the analysis if included, « 

Using core cluster analysis, PC scores are thus characterized in terrns of 
"outHers" and "inliers". The PC scores y/hich are "inliers" may then be manipiilated 
according to either of the centroid or ellipsoid models. The centroid model is discussed 
first below, followed by a discussion of the ellipsoid model. 

The determination of whether DNA structural changes for the 
progression of non-cancerous (NC) to non-metastatic tumor (NMT) are the same as for 
the progression of non-metastatic tumor (NMT) to metastatic tumor (MT) is tested on 
the basis of cenfroids statistically derived firom groups of points. The centroid is the 
vector of mean absorbances of the 1051 individual wavenimibers fi-om 1750 to 
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700 cm■^ If the two progressions are similar, then the centroids of the three groups hne 
up in two- and three-dimensional space. 

Formally, the hypothesis that cos(^ - 1.0 is tested, where ^is the angle 
between a vector x pointing from the NC to the NMT centroid and a vector y pointing 
5 from the NMT to the MT centroid. cos^S) is defined by cos(0 = xy/{\K\ • \y\), the 
vector X is indexed by wavenumbers and, at each wavenumber, contains the difference 
between the mean normalized absorbance of NMT spectra and the mean normalized 
absorbance of NC spectra. The vector y shows the corresponding difference for MT 
minus NMT spectra. An angle 9—0 [which is equiva:lent to cos(^ = 1.0] implies that 

10 the MT is a "virtual straight ahead" continuation of the NC — > NMT progression, and 
that the centroids line up, whereas 9^0 iinplies that the NMT — > MT progression 
involves a different suite of spectral (structural) changes. The hypothesis that cos(0 = 
1.0 is tested using the bootstrap method (Efron and Gong, Am. StaL 37:36-48, 1983), 
which involves resampling with replacement from the NC, NMT, and MT core clusters 

15 and calculation of cos( 6) for each resampling. 

To determine if the populations from which the NC and NMT core 
clusters are drawing have distinct centroids distinct mean absorbance spectra), a 
permutation test is carried out on the distance between the NC and NMT centroids, 
randomly permuting labels among NC and NMT samples and recalculating, distances 

20 between centroids. A similar permutation test is carried out for the distance between 
the NMT and MT centroids. Finally, the sizes of the three core clusters is compared 
using the Kruskal-Wallis ANOVA and Mann- Whitney (MW) tests on the distance of 
each spectrum to the centroid of its cluster. (The P values from the Kruskal-Wallis and 
MW tests are approximate, due to some statistical dependence introduced when sample 

25 . values are compared with their sample mean.) 

Wavenumber-absorbance relationships of infrared spectra of • DNA 
analyzed by principal components analysis (i.e., PCA of FT-IR spectral data) may be 
expressed as points in space. Each point represents a highly discriminating measure of 
DNA stmcture. These PC scores can be plotted in 2- and 3-dimensional plots. The 

30 position of a spectrum in a plot is a description of how it differs from or is similar to" 
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Other spectra in the plot. Different plot symbols or clusters for different groups of 
spectra help to highlight clustering of spectra. In addition, when two groups of spectra 
are analyzed, logistic regression can be used to develop a model for classifying the 
spectra based on their PC scores. Logistic regression . is a method commonly used, for 
S classificatipn and is available in many statistical software packages (such, as SAS and 
S-Plus). The PC scores are predictors and the result is an equation (a model) which can 
be used to classify specimens. Each specimen is tagged with a nimierical probability of 
being in the cancer group (for example) versus the non-cancer group. The results of this 
analysis can be plotted as a sigmoid curve with the cancer risk score (the logit of the. 
10 estimated probability) on the X-axis and the estimated probability on the Y-axis using 
the prediction equation, the probability for a new specimen can also be calculated. By 
choosing a cut point (such as a probability of 0.5 or greater) all specimens can be 
classified as cancer or non-cancer (for example). The sensitivity and specificity of the 
classification can also be calculated using standard methods. 

IS Combination of FT-IR spectroscopy with statistics 

FT-IR spectra are sensitive represratations of DNA structure (refs. 2, 
4-6). Subtle changes, such as in redox status induced by free radicals (refs. 1, 5, 6), will 
likely affect vibrational and rotational motion, thus altering wavenimiber-absorbance 
relationships. Structural differences between two groups of DNAs can be identified 

20 using r-tests on the grand mean spectra, such as shown in Fig. 14 A. The resultant 
P- values are given in Fig. 1 4B (ref. 4). The ^tests provide a P-value for the difference 
in mean absorbance at each wavenumber. In contrast, PCA is based on over a million 
correlations between absorbance-wavenumber values over the entire spectrum (ref. 2). 
The numerous variables comprising the complex spectral relationships are taken into 

25 account and reduced to a few PC scores that are independent of each other. Each PC 
score is a weighted sum of the waveiiimiber-by-wavenumber deviations of a spectrum 
from the grand niean spectrum: In essence, the PC score represents a group of distinct 
spectral (hence, structural) properties of DNA. . 
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Usually, the. first two or three PC scores comprise « 80% of the total 
-variance. Three- (Fig. 15A. 16A) or two- (Fig. 16C) dimensional plots can be 
constructed based on these scores.' each spectrum being represented by a single pomt 
whose spatial orientation is a highly discriminating measure of DNA structure. 
5 Virtually identical spectra (Fig. 15B) can be separated as points in a PC plot (Fig. 15A. 
a and b). Moreover, tvyo outlier points (Fig. 15A,.c and d) representing spectra that are 
markedly different from the mean spectrum (Fig. 15C. D) are located well away from 
the main cluster. 

Logistic regression or discriminant analysis estimates a specimen's 
10 "cancer probability^' between 0.0 (non-cancer) and 1.0 (cancer), based on its PC scores. 
Predicted cancer probabilities, derived from a model using the Pd scores, areplotted vs. 
calculated risk scores (Fig. 16B. D). Probability values between those of normal and 
transformed tissues represent various degrees of cancer risk (refs. 2,4-6). The 
probability-risk relationships constitute a promising basis for screening and prognostic 
15 trials. 

/^ r plir-atinns o f FT-TR /statistics technology 

In studies of breast cancer (refs. 2.5,6). major spectral differences were 
found for the progression normal breast -> breast cancer (invasive ductal carcinoma; 
IDC). A three-dimensional PC plot revealed a distinct cluster of points representing the 

20 DNA of each group (Fig. 16A ). PC points for the IDCgroup were selected out and 
presented in Fig. 15A. Point c that represents the DNA of a patient with bilateral breast 
cancer was completely separated from the main cluster representing the DNA of 
patients with single breast tumors (ref. .2). Differences in the lesion status of a tissue 
were found to markedly shift the PC point position. Point that represents a specimen 

25 containing a second focus of signet ring cell carcinoma, a highly malignant lesion, is 
well separated from the main cluster. These examples demonstrate that the 
FT-m/statistics technology has a potentially high capability for elucidating DNA 
structural changes in relation to a variety of biological conditions. 
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Nornial and breast cancer PC scores, for a total of 54 samples, were 
analyzed using logistic regression and the resulting sigmoid curve of cancer probability 
vs. the risk score (Fig. 1 6B) showed a number of transitional values between non-cancer 
and cancer. In classifying the samples (including four additional distinct outliers) the 
'5 predictiye model had a sensitivity of 86% (percent of patients with cancer correctly 
classified) and a specificity of 81% (percent of patients without cancer correctly 
classified), using 61.5% probability as the cut-point (The cut-point was chosen to 
jointly maximize sensitivity and specificity and may vary among diseases and 
. populations.) The. power of the model was substantiated by an independent test. 
10 Spectra of microscopically normal tissue (MNT) fi-om near the breast tumors of 11 
women (not included in the predictive model) were analyzed and the corresponding PC 
scores were calculated. When the scores were iised in the model, ten of eleven (91%) 
had a predicted cancer probability > 75%. Thus, on the base of their DNA structures 
the MNTs were classified as "high risk." This is supported by data showing that tissue 
1 5 near a breast tumor has a high risk for developing a second lesion (ref. 6). 

Comparisons of grand mean spectra for the progression primary breast 
cancer metastatic breast cancer showed that .the structure of DNA was markedly 
altered (ref. 2), as suggested by pronounced differences in spectral areas assigned to the 
nucleotide bases and deoxyribose. These changes, attributed primarily to an increase in 
20 reactions of the •OH with DNA, resulted in a substantial increase in structural diversity 
that was calculated on. the basis of PC scores as previously described (ref. 2). The 
determination of diversity provides a usefiil measure of structural damage to DNA, such 
as induced by free radicals. 

A comparison of grand mean spectra m the progressions normal prostate 
25 prostate cancer (Fig. 14A) and nomial prostate -> benign prostatic hyperplasia 
(BPH) revealed for ttie first time that the transformations involve significant structural 
alterations in DNA (ref. 4). The first two PC scores (76% of the total variance) were 
used for a two-dimensional plot (Fig. 16C). The groups showed distinct clustering. 
The prostate lesion clusters were located to the right of those of the normal prostate, and 
JO . the BPH cluster was located to the right of the cancer cluster. The spatial arrangement 
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suggests that the hypothetical progression BPH prostate cancer (ref. 7) is unlikely 
because it would require a structural reversion compared to the normal -> BPH 
transformation (ref 4). This implies that each type of lesion is biologically derived 
independently, or that there are additional alterations in the DNA of BPH that mimic a 
5 reversal in the progression to cancer. 

The probability of proistate cancer,, obtained via discriminant/analysis, 
was plotted vs. the risk score (the logit of the probability) and revealed near separation 
of the groups (Fig. 16D). The discriminant model (calculated using a total of 12 cancer 
and non-cancer samples) represented the clusters as multivariate normal distributions. 

10 In classifying the samples (including one additional cancer outlier) the predictive model 
had a sensitivity of 88% and a specificity of 80%, using 50% probabihty as the cut- 
point. The technology affords a promising opportunity for additional studies of prostate 
cancer, to include the putative etiological relationship between prostatic intraepithelial 
neoplasia (PIN) and adenocarcinoma and the association of prostate specific antigen 

15 (PSA) test results with cancer probability values (ref 7). 

According to the ellipsoid model (which may also be referred to as the 
"multivariate normal model" or "MNM'0,.the PC scores capture patterns in variation in 
FT-IR spectra, where each PC score is a weighted sum of absorbencies by wavenumber, 
as stated above. Each PC score emphasizes particular spectral regions, where a set of 

20 PC scores (about 6 scores are usually sufficient, however a fewer number of scores may 
also be satisfactory) represents each spectrum very well. The PC scores will vary across 
spectra, and will emphasize differences between spectra. Generally, 6 PC scores are 
sufficient to capture at least about 90% of the total variation between the spectra. 

The set of PC scores for a cluster {e.g., IDCm) can be approximated by a 

25 statistical model. Each PC score, e.g., PCI, can be approximated by a "bell-shaped 
curve", z.e., a Gaussian distribution. Thus, (when there are six PC scores) each of PCI, 
PC2, .. ., PC6 can be approximated by a bell shaped ciirve separately. When several 
states are analyzed together, PCI, PC2, etc. are usually correlated within a given state 
(e.g., EDCm). The full model is the multivariate normal distribution, which is a 

30 mathematical equation. 
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The model may be viewed as infinitely many combinations of PCI, PC2, 
... PC6, etc. but some combinations are more probable than others. It is possible to 
draw a random sample irom the model, and it is not necessary to have the original data 
to do this (the model is sufficient). If the sample is plotted (e.g., PC2 vs. PCI), the plot 
5 will show great density where the mathematical model indicates that spectra are more 
likely to occur. 

The model also allows construction of ellipsoids that captures ^ 90% (or 
any desired percentage) of the infinite possibilities from the modej. Mathematically, 
numerical methods are used to integrate the model function, where integrating inside 

10 the 90% ellipsoid yields 90% of the value obtained by integrating over -oo to +oo. The 
ellipsoid will contain 90% of the probability. A randomly selected IDC^ spectrum , for 
example, is 90% mdre likely to fall inside the ellipsoid generated from JDC^ data. The 
length, width and height of a 3-dimensional ellipsoid are proportional to the standard 
deviation of PC score 1, PC score 2, PC score 3, respectively, for that cluster (e.g., 

15 IDCm)- The actual calculations are calculated using the chi-squared distribution. 

In sxunmaiy, according to the ellipsoid model, the invention provides a 
method comprising the steps of: 

(a) subjecting a plurality ("m"). of DNA samples from a first of **n" 
defined states of a tissue of interest (e.g.^ samples of normal prostate tissue firom **m" 

20 different individuals) each to Fourier transform-infi-ared (FT-IR) spectroscopy to 
produce FT-IR spectral data; 

(b) independently analyzing the FT-IR spectral data* from each 
sample of step (a), by principal components analysis (PCA) to provide a plurality ("o") 
of principal component (PC) scores (i.e.*, PCI, PC2, PCS ... PCo scores) from each of 

25 the "m" FT-IR spectra, every sample being characterized by an identical number of PC 
scores as obtained by the identical treatment of the FT-IR spectral data, to provide "m" 
sets of PC scores, each set containing '*o'* values; 

(c) applying cluster analysis to the set of PC scores from the "n" 
defined states of the tissue of interest (Le.y tp all of the PCI to PCo scores obtained from 
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the FT-IR spectra of the "m" samples of DNA) as obtained from all of the samples, to 
identify outlier and non-outlier tissue samples; 

(d) generating an equation defining a multivariate version of a 
normal bell-shaped curve which best fits the non-outher PCI PCo values for all of the 

5 samples in the first defined state; 

(e) repeating steps (c) and (d) for each of the sets of PC scores 
obtained fi-om step (b), to define a set of "n" equations, each of the "n" equations 
defining a multivariate version of a normal bell-shaped curve corresponding to each of 
the "n" sets of PC scores; 

10 (f) applying multivariate discriminant analysis to the "n" equations 

defining multivariate versions of normal bell- shaped curves of step (e), to define a 
probability equation for the each of the "n" defined states of the tissue of interest. 

According to the procedure outlined above (steps (a) through (f)), a 
probability equation is generated corresponding to each defined state of interest for a 

15 particular tissue of interest, where in combination these "n" probability equations define 
a model. 

A sample of tissue of interest having an unknown defined state is then 
analyzed by FT-IR, and the spectral data obtained thereby is subjected to principal 
components analysis to define "o" PC scores. These "o" PC scores are then "plugged 

20 into" each of the "n" probability equations corresponding to the various defined states 
withiii the model for the same tissue of interest, to provide a number ("n") of probability 
scores corresponding to the number of defined states firom which the model was 
constructed. A probability score, is thus obtained for each of the defined states of the 
model, A higher probability score indicates a higher likelihood, that the tissue of 

25 interest is properly characterized by the defined state corresponding to the probability 
equation. For example, if plugging the PC scores into the prolDabihty equation 
corresponding to normal tissue provides a probability score of *\v", and if plugging 
those same PC scores into the probability equation corresponding to metastatic cancer 
provides a probability score of "x", and "x'' < *Sv", then the sample is more likely to be 

30 . normal tissue than metastatic cancer. 
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Thus, the invention further provides a method comprising the steps of 
(1) performing step (a) through (f) above, to provide a model 
comprising a number ''n" of probability equations corresponding to a number "n" of 
defined states for a particular tissue of interest; 
5 . (2) performing steps (g) through (j), as follows: 

. (g) subjecting a DNA sample fi^om a tissue of interest having an 
unknown defined state, to Fourier transform-infirared (FT-IR) spectroscopy to produce 
FT-IR spectral data; 

(h) analyzing the FT-IR spectral data of step (g) by principal 
10 components analysis (PCA) to provide a plurality ("o") of principal component (PC) 

scores (z.e.,PCl, PC2, PCS ... PCo scores), to provide a set of "o" PC scores, 

(i) "plugging in" the set of "o" PC score of step (h) into each of the 
"n" probability equations which compose the model of step (f) to obtain a probability 
score corresponding to each of the "n" defined states; and 

15 0) . comparing the **n" probability scores fix5m step (i) to oiie another 

in order to determine the most likely defined state into which the tissue having an 
unknown defined state is a member. 

As seen in Figures 18, 19, 20 and 21, the ellipsoids overlap. In fact, the 
full model for these two or three clusters overlap ever3nvhere. In other words, for any 

20 given location in the three-dimensional spac.e, there is a probability that the spectrum 
for that point belongs to, e.g., RMT, another probability that .it belongs to IDC, and 
another probability that it belongs to IDCm- However, each group (IDC, TDC^ and 
RMT) has greater density at some locations than others. For a given sample, it is 
assigned to the group that has the greatest density at the location (PC scores) of the 

25 sample. Therefore, even where the 90% IDC. ellipsoid is buried inside the 90% IDCm 
ellipsoid, the IDC is hkely to have greater density at much or most of these interior 
points. Thus^ a sample that provides PC data that occurs within this overlapping space 
is more likely to be an IDC. 

In general, the ellipsoid model of the present invention allows 

30 construction of a model to represent normal, EDC and BDCm, spectra/tissue. After 
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obtaining PC scores as described above, the correlation and diversity of PC scores is 
determined. Selected data is then fit to a statistical model with the same correlations 
and diversities, based on a multivariate version of the bell-shaped curve. The model can 
be represented by ellipsoids containing an estimated 90% of the populations of each 

5 group. • 

The present invention allows for a prediction of the transformation of 
breast tissue. According to the ellipsoid model, PC scores fi-om a sample of breast 
tissue may be used to calculate three probabilities: probability that the tissue is normal, 
probability that the tissue is IDC, and probabiUty that the tissue is IDCm. The tissue is 

10 assigned to the group that gives it the highest probability. In fact, using the ellipsoid 
model, normal tissue was correctly identified 89% of the time (16 of 18 samples) while 
cancer tissue was correctly identified 97% of the time (31 of 32 samples). In addition, 
using the ellipsoid model, primary (EDC) cancer was correctly identified 100% of the 
time (10 of lO.samples) while metastatic (IDCm) cancer was correctly identified 82% of 

15 the time. Thus, the ellipsoid model is particularly well suited for correctly classifying 
• and differentiating primary cancer tissue (correctly identified 97% of the time) and 
metastatic cancer (correctly identified 82% of the time). 

The present invention analyzes DNA samples by PCA of FT-IR spectral 
data and shows surprisingly that the direction of the progression of non-cancerous 

20 ("normal") DNA to non-metastatic tumor ("primary tiimor*') DNA differs significantly 
fi-om the direction of the progression of primary tumor to metastatic tumor. By 
comparison of PCA of FT-IR spectra for a DNA sample of interest, to PCA of FT-IR 
spectra for DNA samples firom known non-cancerous, non-metastatic tumor and 
metastatic tumor samples, one may determine whether the sairiple of interest is in orie of 

25 these three states or progressing toward one of the tumor states. 

For example, the present invention provides methods for the detection of 
prostate cancer. The present invention applies technology employing principal 
components analysis (PCA) of Fourier-transform infirared (FT-IR) spectroscopy 
(PCA/FT-IR technology) to DNA derived from the normal prostate, benign prostatic 

30 hyperplasia (BPH) and adenocarcinoma. As described in detail below, clusters of 



wo 99/00660 ' PCTAJS98/13386 

26 . 

points representing DNA from each of these tissues were almost completely separated 
in two-dimensional plots of principal components (PC) scores.^ This indicates that 
significant and specific structural modifications in DNA occur in the progression of 
normal tissue to BPH and normal tissue to prostate cancer, and that the modifications 
5 are unique for each of the two progressions. The structural alterations are reflected 
primarily in spectral regions representing vibrations of the nucleic acids, phosphodiester 
and deoxyribose structures. The separation and classification of the normal prostate 
versus BPH or adenocarcinoma is shown using logistic regression models of infirared 
spectra. Similarly, logistic regression models of DNA spectra are used herein to 
10 evaluate the relationship between BPH and prostate cancer. 

In the present characterization of DNA firom prostate tissue, 
wavenumber-absorbance relationships of infi-ared spectra analyzed by principal . 
components analysis (PCA) are expressed as points in space. Each point represents a 
highly discTirQihating measure of DNA structural modifications that altered vibrational 
15 and rotational motion of fimctional groups of DNA, thus changing the spatial 
orientation of the points. Application of PCA/FT-IR technology to prostate tissue 
provides a virtually perfect separation of clusters of points representing DNA fix>m 
nomial prostate tissue, BPH and adenocarcinoma (prostate cancer). The progression of. 
normal prostate tissue to BPH and to prostate cancer appears to involve structiiral 
20 alterations in DNA that are distinctly different. Models based on logistic regression of 
infixed spectral data are used to calculate the probability of a tissue being BPH or 
adenocarcinoma. Remarkably, the models have a sensitivity and specificity of 100% 
for classifying normal versus cancer and normal versus BPH, and close to 100% for 
BPH versus cancer. Thus, the present invention shows that PCA/FT-IR technology is a 
25 powerfiil means for discriminating between normal prostate tissue, BPH and prostate 
cancer, with applicability for risk prediction and clinical application. 

Although is it likely that the most popular use of the invention may be to 
assess the health of an individual organism with respect to cancer, it will be evident to 
those in a variety of arts that there are other uses. For example, the invention permits 
30 the analysis of enviroEucnental hazards. By analyzing DNA (as described herein) of an 
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organism after exposure to an environment of unknown genotoxicity and comparing 
that profile to one obtained firom DNA of the organism prior to its introduction to the 
environment (or comparing to an organism . in a nonpoUuted . environment)^ an 
assessment of the genotoxicity of the environment can be made. In a preferred 
5 embodiment, the species of the organism in a noi^oUuted environment is identical to 
that of the organism in the environment of unknown genotoxicity. As used herein, the 
term "nonpoUuted environment** includes without any chemical contamination or the 
absence of a specific pollutant or pollutants. 

■ Importantly, the examples show that the use of the FT-IR/statistics 
10 technology has considerable promise for identifying structural alterations in DNA prior 
to the manifestation of transformed cells. These alterations can be used to establish 
disease probability models having potentially wide application in biology and medicine. 

Other ^pplipatjons 

The FT-IR/statistics technology described herein focuses on biological 
15 systems in which changes in DNA structure are known to play, or are suspected of 
playing, an important role in the development of disease. Notable examples to which 
the methods of the present invention may be directed include various forms of cancer 
(refs. 2, 4-6,8,9), Alzheimer's disease (ref 10), diabetes mellitus (ref. 11), heart disease 
(ref 12) and Parkmson's disease and other neurodegenerative disorders (ref 13). DNA 
20 changes are also potentially important in the putative relationship between 
electromagnetic fields and cancer (ref. 14), infertility (ref 15), radiation effects (ref. 
16), aging (ref 17), pharmacokinetic evaluations of drugs (ref. 18) and genetic, 
alterations in cultured cells (ref. 14). Moreover, studies linking oligonucleotides having 
different base arrangements to .their corresponding spectral properties, as revealed by 
25 statistical models, may be used to expand the scope of the technology in understanding 
genetic alterations. 
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The following examples are offered by way of illustration and hot by 
way of limitation. 

EXAMPLES 

.5 . ■ ' 

In the Examples, the analysis of the data was according to the centroid 
(also called the "sigmoid") model. However, the data acquisition and chara;cterization 
in terms of PC scores and cluster analysis would be the same for the ellipsoid model. In 
the ellipsoid model, the "inlier" PC scores (as identified by cluster analysis) would be 
10 fitted to a multivariate normal distribution, which is essentially a multivariate 
generalization of the normal (Gaussian) bell shaped curve, and then the various 
equations describing the bell-shaped curves as obtained fi-om a certain tissue type would 
be subjected to discriminant analysis to provide probabihty equations. Commercially 
available statistical programs, e.g., SAS, can generate the appropriate models, and 
1 5 perform the necessary discriminant analysis, if the raw data (PC scores) are provided. 
As more data become available, the SAS program will generate more accurate 
probability equations. The SAS program will also be able to receive PC scores firom a 
sample having an unknown defined state, and then "plug" these values into the 
probability equations to provide probability scores for the sample have a given defined 
20 state. Many statistics textbooks also provide descriptions of discriminant analysis and 
the construction of multivariate normal bell-shaped curves. 

Figure 14 provides a picture and schematic diagram of a FT-IR 
microscope spectrometer (System 2000, Perkin-Elmer Corp., Norwalk, CT) and its use 
for elucidating DNA structure. DNA (10-15 fag), extracted firom a split tissue, is 
.25 lyophilized. The dry, fluffy DNA is rolled out on a microscope slide forming a thin, 
transparent fihn that is peeled off with a scalpel and placed onto thd BaF2 window. - The 
: microscope is focused on the film when the visible beam is introduced in-path. 
Inserting the aperture, ten uniform areas of diameter > 100 fxm are chosen. The infi-ared 
beam is switched in-path and focused through each area, scanning between 2000 and 
30 700 cm'^ after a background scan. on the BaF2 window. The interferogram recorded in 
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the detector is Fourier-transfdraied to an absorbance spectrum. Each spectrum is 
baselined (the mean absorbance across 11 wavenumbers, centered at the minimum 
absorbance between 2000 and 1700 cm"', is subtracted from the total absorbaiices) and 
then normalized (the entire baselined spectral absoibances are divided by the mean 
5 between 1750 and 700 cm"^) to adjust for the sample's optical characteristics (e.g.. 
related to film thickness). These procedures can be carried out with- simple functions in 
the S-PLUS statistical. package (Mathsoft Corp., Analysis Products Division, Seattle; 
WA). Ultimately, a grand mean is obtained for the DNA of one type of tissue (e.g., 
healthy prostate) which can be compared statistically to that of another type of tissue- 
10 (e.^.,. prostate cancer) (4). (Fig. 14A) two overlaid grand mean spectra Absorbance 
values between 1700 and 1450 cm"' are assigned to C-O stretching and NHj bending 
vibrations, and 1450-1300 cm"' to NH vibrations and CH in-plane deformations of 
nucleotide bases. The antisymmetric stretching vibrations of the PO^" structure occur at 

« 1240 cm-* and vibrations of deoxyribose are generally assigned to absorbance values 
15 between 1150 and 950 cm'' (6); (Fig. 14B) P-values obtained for each wavenumber 

using the unequal variance Mest. P-values < 0.05 (shown in the regions 1590-1510 

cm ' and 1060-1010 cm"') are evidence for a spectral/structural difference between the 

DNA samples. 



20^ 



KXAMPLE 1 
Prostate Cancer 

A. Tissiie Acquisition, DNA isolation and PCA/FT-IR Spectral 
25 Analysis: After excision, each tissue was flash frozen in liquid nitrogen. All tissues 
were kept at -80°C prior to use and DNA was maintained under an atmosphere of pure 
nitrogen during the extraction procedure to avoid oxidation. DNA was isolated from 
the tissues and aliquoted for FT-IR spectroscopy (about 20 jig). Each DNA sample was 
• completely dried by lyophilization, purged with pure nitrogen, and stored in an 
30 evacuated, sealed glass vial at -80'C. A total of 31 tissue samples were used. Five 
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samples of prostate tissue obtained from individuals who died by accidents were 
exainined histologically and found to be normal. These served as controls. Eighteen 
samples of .benign prostatic hyperplasia (PPH) and eight samples of adenocarcinoma 
(cancer) served as test samples, each comprising a portion of the histologically 
5 identified lesion. AH samples were obtained from the Cooperative Human Tissue 
Network, Cleveland, OH, together with related pathology data. 

The IR spectra were obtained using the Perkin-Elmer System 2000 
equipped with an I-series microscope (The Perkin-Elmer Corp., Norwalk, CT). For 
PCA/FT-IR spectral analysis, each spectrum was normalized across the range of 1750. 

10 to 700 cm"-, as described above. This yielded a relative absorbance value for each 
wavenumber, with a mean of 1 .0. Euclidean distance was used to define the difference 
between a pair of spectra either for the entire spectrum or for a sub-region. This 
standard distance measure is defined as the square root of the sum of squared 
absorbance differences between spectra at each of the wavenumbers considered (e.g,^ 

15 1051 for the entire spectral region 1750-700 cm'*). The Euclidean distance can also be 
expressed in a more descriptive form as a percent. The numerator of the percent is the: 
Euclidean ^distance divided by the square root of the number of wavenumbers for a 
region. The denominator used here for the percent for any region is the mean 
normalized absorbance between 1750-700 cm ^ which is 1.0 for every case. 

20 Principal components (PC) analysis (PCA) was used to identify a few 

variables (components) that capture most of the information in the original, long list of 
variables (the spectral absorbances at each wavenumber). This reduction in the number 
of variables is analogous to the process in educational testing whereby many individual 
test scores, such as in reading and arithmetic, are combined into a single academic 

25 performance score. Four PC scores (e,g.j four dimensions) were found to be sufficient- 
to describe the 1051 dimensions of the noigialized spectra. PC scores were calculated 
with the grand mean of all spectra subtracted from each spectrum. The nonparametric 
Spearman correlation coefficient was used to assess the association of PC scores with 
patient ages and Gleason scores. The nonparametric analysis was used because some of 
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the distributions are skewed or are not nonnal ("bell-shaped"), which can lead to a bias 
in statistical significance when estimated frpm the Pearson correlation coefficient. 

Two cases, which were outliars, wore omitted from these analyses, 
leaving 29 cases. The omitted BPH sample and the omitted cancer sample had spectra 
very different from the included cases. Their EucUdean distances from the most similar 
spectra were 52% and 41%, respectively. All other spectra differed from their '"nearest 
neighbor" spectrum by at most 21%, with a majority of spectra differing by less than 
1 1%. The two outlier spectra show drastically reduced absorbance in the region around 
1650 cm"l, representing vibrations of the nucleic acids. 

The Kruskal-Wallis and Mann- Whitney tests were used to determine if 
the three groups had similar diversity, defmed as the mean distance of a spectrum to its 
group centroid. A pennutation test was used to determine whether the three groups 
tended to cluster separately (representing an internal similarity of spectral properties in 
a group). The distance of each spectrum to its nearest neighbor in its own group (either 
normal, BPH, or cancer) was calculated, and the mean of these nearest neighbor 
distances for all of the spectra was the test statistic. The test was carried out by 
randomly permuting group membership labels 10* times and recalculating the test 
statistic each time. A smaller observed distance to tiie nearest neighbor than that 
obtained by random relabelling of groups is an indication of clustering. A 
nonparametric, ^ank-based version of this test was carried out by expressing each 
distance as a rank. For each spectrum, the distances to other spectia were ranked and 
the pennutation test: was carried out as described aboVe, but witii distances replaced by 
ranks. The test statistic was a mean rank. Again, a smaller observed mean rank than 
the mean obtained from random pennutation is art indication of clustering. Both the test 
using distance and the test using ranks were carried out for the entire spectiiim, 1750 - 
700 cm'', and for several subregions. ■ • • . 

Finally, logistic regression analysis was used as a model to determine if 
PC scores could be used to discriminate between pairs of DNA groups (normal versus 
BPH, nonnal versus cancer and BPH versus cancer). The logistic regression analysis 
yields a risk score, which is a linear combination of PC scores, and a predicted 
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probability of a sample being in one of the two groups considered (e.g-., the probability 
of being BPH when BPH is compared to normal). These predicted probabihties, along 
with a chosen probability cut point, can be used to classify samples and provide 
estimates of sensitivity and specificity, or percent of samples correctly classified. For 
5 each analysis a cut point was chosen that jointly maximized sensitivity and specificity. 

' B. Clustering in PC Plots: PCA/FT-IR spectral analysis yielded- 
four components (four PC scores per case) which explaiiied a total of 90% of the 
spectral variation over 1051 wavenumbers. That is, most of the features of the 29 
spectra could be described by four PC scores (labeled PC 1, PC2, PC3, PC4). The first 
10 two PC scores explained 76% of the variation and were adequate for two-dimensional 
representation (Figure 1).. Figure 1 shows that the three groups were distinctly clustered. 
The two outliers omitted from the analysis are also represented on this plot and appear 
to the right of the main clusters. . 

The actual distance of the outlier points to other points is larger than that 
15 shown in this two-dimensional plot due to differences represented by other dimensions. 
The permutation test for clustering of groups (1750 - 700 cm"^) yielded P = 0.1, based 
on the distance measure, and P = 0.01 using the nonparametric ranking technique. 
(Table 1). The greater significance obtained by the ranking method arises from the 
relative isolation of one or two cases firbm the core of their group (Figure 1), a 
20 configuration which influences the distance measure more than the ranking measure. 
Using these techniques, significant clustering was obtained for two regions of the 
spectrum: 1174 - 1000 cm'^ (assigned to strong stretching vibrations of the PO/ and 
C-O groups of the phosphodiester-deoxyribose structure) and 1499 1310 cm"' 
(assigned to weak NH vibrations and CH in-plane deformations of the nucleic acids). 
25 The P-values for mean distance and mean rank for these regions ranged from 0.02 to < 
0.001 (Table. 1). The significance levels obtained strongly reject the null hypothesis 
that the observed clustering of the three groups occurred by chance. Overall, the 
findings indicate that DNA is altered in ways that produce clustering and, consequently, 
discrimination between noraial prostate, BPH and prostate cancer DNA (Figure 1 ; 
30 . Tables 1 and2). 
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Detailed comparisbns were made between the spectra of pairs of groups: 
normal vs. cancer, normal vs. BPH and BPH vs. cancer. The statistical significance of 
differences in mean nonnalized absorbance between groups was assessed for each 
wavenmnber between 1750 - 70G cm'', using the unequal variance t-test (Figure 2; 
5 A-C). The plot shows the comparison of the mean spectrum for each of the two groups, 
as well as the P-value from the t-test. The regions with P < 0.05 represent differences 
• ■■ between groups (e.g.. normal vs. cancer) which are much less likely to be dUe to chance 
■than regions with P' > 0.05. Each of the spectral comparisons between groups shows 
statistically significant differences in areas of the spectrum assigned to vibrations of the 
10 phosphodiester-deoxyribose structure and the nucleic acids. The spectral regions with 
significant differences in absorbance for the phosphodiester-deoxyribose structure are 
similar («1P50 - 1000 cm"'); hpwever, absorbances associated with the nucleic acids 
vary among the groups. That is, for the normal-cancer comparison, the region of 
significant difference is primarily «1475 - 1400 cm ' (C = O stretching and NH bending 
15 vibrations), whereas for the ndrmal-BPH comparison it is «1600 - 1500 cm '. The 
. comparison for BPH^ancer is focused at « 1500 cm'. For the normal-BPH and 
BPH-cancer comparisons, significant differences are shown between. «1 175 to 1120 
cm-', a region that Ukely includes symmetric stretching vibrations of the POj group. 
. The difference in means at all of these spectral regions is apparent from the plots of 
20 mean spectra per group in Figure 2. The structural modifications are pivotal in the 
spatial distribution of points in the PC plot (Figure 1) and in the pronounced 
discrimination between clusters (Table 1). 
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Table 1 



Mean distance to nearest neighbor of same group and permutation test for non-random . 
clustering. Distance is expressed as a percent difference between spectra; 
10^ permutations were performed for each spectral sub-region. • 



Spectral region 
{cm") 


observed 


Mean distance' 

random 
permutation 


P-value 


observed 


Mean rank^ 

random 
permuta.tion 


P-Value 


1750-700 


12.2 


12.8 


0.1 


2.0 


3.0 


0.01 


1750- 1500 


12.3 


12.3 


0.5 


2.4 


3.0 


0.09 


1499- 1310 


. 5.9 


6.5 


6.02 


1.6 


3.0 


^0.001 


1309-1175 


6.7 


6.5 


0.7 


3,0 . 


3.0 


0.5 


1174-1000 


13^ 


15.0 


0.02 


2.0 


3.0 


0.01 


999 - 700 


6.9 


7.4 


O.I 


2.3 


3,0 


0.05 



*Mean Euclidean distance to nearest neighbor in Ac same group expressed as a percent. 
^Mean rank of Buclidean distance of each spectrum to nearest neighbor in the same group. 



C. Cluster diversity: The diversity of the three groups, expressed as 
the mean distance to the group centroid, did not differ significantly (p = 0.8). However, 
the normal prostate group was shghtly less diverse (mean distance = 11.7%) than was 
the BPH group (mean distance = 14.5%) or prostate cancer group (mean distance == 
13.9%). Increased structural diversity generated in primary tumors is likely, an 
important factor in selecting DNA forms that potentially give rise to malignant cell 
populations. 

D. Group Classification: PC scores can be readily used to classiJfy 
patients into groups when pairs of groups are compared using logistic regression. The 
logistic regression model (Table 2) is an equation which yields a risk score, R, when the 
values of the PC scores are inserted into the equation. R is transformed to a probability 
by the following standard statistical equation: probability = exp(R)/[H-exp(R)]. A cut 
point is chosen and if the probability exceeds this cut point, the case would be classified 
as BPH. The actual cut points are noted below. As shown in Table 2, the model for 



PCTAJS98/13386 

^ WO 99/00660 

37 ■ - 

norra^ versus »ncer and normal versus BPH correCly olassifles each group , 00% and 
100% overaU (P-values in each case were O.OOl). The con«=. classificrtion rate for 
cancer versus BPH was dose ,o 90%. based on a deaignaiion of "cancer- for a predicted 
probability of ^0.1. (ProbabiUty cut-points of 0.15 to 0.41 achieve the same conec. 
5 classificaHcn rates in the BPH vs. cancer comparison.) The predicted prpbabtht.es 
hased on the models in Table 2 are given in Figure 3. . Tire individual risk score is based 
on the appropriate PC model (Table 2) and the predic.«l probability is a mathemattcal 
fcncdon of the risk score, as noted, above. All of ttre BPH and cancer cases have 
predicted probabilities extremely close to 1 .0 and .11 of the normal cases have predtc.«i 
,0 probabiUties of ^0.002 when BPH or cancer are compared to nonnal cases. These 
masked distinctions in predicted probabilities confirm the clear separation of groups, as 
ahov™ m Figure 1. When cancer is compared to BPH, predicted cancer probabUthe. 
ranged ftom 0.42 «, 1.00 and predicted BPH probabilities ranged fiom 0.00 to 0.65. 

■ The two outUers omitted ftom the analyses tend to support the findmgs. 
. 15 The ouflier BPH and cancer.points lie to the right m the PC plot (Figure 1). This is the 
same direcdon fbund with fte progressions fiom normal to BPH and ftom normal to 
, cancer, suggesting that the oumer DNAs have a higher degree of stnrc«,ral 
^dificadon. When the models shown in Table 2 were used to classify the two outhers, 
the BPH outUer was correcay classified, using the normal versus BPH model, wtth a 
20 predicted BPH probabiUty close to 1.0. The cancer outlier is also correctly classified m 
the-normal versus cancer model wifit a predicted cancer probability close to 1.0. In the 

BPH versus cancer model, the BPH ouflier is con«>tty classified with a predtcted 
cancer probability close to zero; however, the cancer ouffler ia incorrcctty classified as a 
BPH with a cancer probability close to zero. 
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Logistic regression models for probability of BPH (vs. Normal), Cancer (vs. Normal) 
and Cancer (vs. BPH). Normal, n = 5; BPH, n=17. P-values are based on the nuU 
hypothesis that each model is not predictive of group membership. P-values i 
5 calculated from a chi-square test on change in deviance. 



are 



. Model 


Intercept 


CoefELcients ± Standard Errors 
PCI PC2 PC3 


PC4 


normal vs. BPH 


24.9 ±0.1 


5.2 ± 0.2 


5.8 ± 0,04 3.9 ± 0.03 




normal vs. Cancer 


34.3 ±0.1 


12.0 ±0.04 




-21.0 ±0.1 


BPH vs. Cancer 


-14.5 ±8.1 


-4.5 ± 2.6 




-11.1 ±6.3 







Correct Qassification Rate 




. Model 




By Group 


Overall 


P-Vaiue* 


normal vs. BPH 


normal: 


100%; BPH: 100% 


100% 


<0.001 


normal vs. Cancer 


. normal: 


100%; Cancer: iOO% 


100% 


<0.001 


BPH vs. Cancer 


BPH: 


88%; Cancer: 100% 


92% 


O.OOl 



10 



•P-value for the nuU hypofliesis that the probabiUty of a case felling iito a specified group is unrelated to 
me PG scores. 



15 



E. Age and Gleason Score relationships: Age does not appear to be 
a factor in creating the pronounced distinctipns among groups, although the incidence 
of prostate cancer increases significantly over the age of 50 years. The age ranges for 
the three groups were 16 - 73 years for nornial (n = 5); BPH, 58 - 73 (n = 17); and 
cancer, 61 - 76 (n = 7). Among the Spearman correlations of age with each of the four 
PC scores, none were statistically significant (P < 0.05). In all, 28 correlations were 
consido-edi consisting of age correlated with each PG score in each of the three groups, 
as we]] as in all pairs of groups {e.g., ags correlafed with each PC score in nonnal and 
20 BPH tissue combined) and in the entire pooled set of 29 cases. Spearman, correlations 
ranged in magnitude from 0.01 to 0.59 with P = 0.09 to P = 1.0. The most significant 



SVO 99/00660 



PCT/US98/13386 



39 



correlation was r = -0.51 between age and PC4 in the combined normal and cancer 

groups (P = 0.09). When PC4 was oiriitted from the logistic regression analysis and 

models were based on PCI - PCS, the P-values corresponding to those in Table 2 were. 

top to bottom, P < 6.001, P < 0.001 and P =r 0.005, again supporting a non-random 
5 distinction among the groups. These results based on PC4 and the weak of 

nonsignificant correlations between age and other PC scores do not support any role for 

age in the abiUty to use spectra to distinguish among the groups. 

The Gleason score, which uses microscopically evinced architectural 

changes to classify tumor status, had little association with the PC scores, although 
10 based on the n = 7 cancer cases, there was limited power to detect other than strong 

associations. Spearman Correlations of PC scores 1 ^ 4 with the Gleason score ranged 

from - 0.49 to + 0.26, with P = 0.2 to 0.8. 

F. Logistic Regression Models of Probability: The Sigmoid curves 
(Figure 3) for the prostate show sharp transitions between the normal and cancer states 
15 and normal and BPH states. These transitions are characterized fay a lack of cases at 
intermediate probabilities, corresponding to the clear separation of groups in Figure 1. 
Thus, at some point in the modification of DNA.. critical structutal changes apparently 
take place that lead to a rapid increase in cancer probability. 

BPH is not known to be etiologically related to prostate cancer; however, 
20 it is of interest that the BPH versus prostate cancer curve (Figure 3C) shows several 
cases having intermediate probabilities. The configuration of cases in Figure 1 also 
provides some insist into the controversial view that BPH is a direct , precursor of 
prostate cancer. The findings do not support this concept in that the BPH groiip lies 
"beyond" the cancer group, starting from the normal group. This positioning suggests 
25 that a transition from BPH to cancer would involve a reyersal of some of the spectral 
transitions shown to be associated with cancer, or that there are additional changes in 
the BPH DNA that mimic a reversal , in the progression to cancer. Alternatively, 
modifications may result in DNA stioictures that lead to a variety of nonneoplastic 
■ lesions, including, BPH. Although BPH may not be a direct precursor of prostate 
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cancer, PCA/FT-IR spectral analysis may provide a promising means of predicting the 
occurrence of prostate cancer, based on the structural status of BPH DNA. 

The absence of transition states in the normal to cancer and normal to 
BPH curves is of interest. This is likely due to the fact that "transition" tissues having 
5 DNA values between zero and 100% probability (Figure 3, A-C) were not part of this 
study. 

Evidence with the prostate suggests that DNA structure is progressively 
altered in response to factors in the microenvironment, notably the •OH, that are likely 
etiologically related to the development of cellular lesions, prostate tumors 

10 (adenocarcinoma) and BPH. Intervention to forestall or correct the genetic instability of 
these tissues and likely increase in cancer risk should focus on controlling the cellular 
redox status and •OH concentrations. The approaches may include control of the 
iron^catalyzed conversion of HjO^ to the -OH (Imlay et al.. Science 240:640-642, 1988); 
regulation of -OH production resulting fiom redox cycling of hormones (Han and Liehr, 

15 Carcinogenesis 75:2571-2574, 1995) and environmental xenobiotics (Bagchi et al,. 
Toxicology 70-/: 129-140, 1995); and antioxidant/reductant therapy (Ames et al., Proc. 
Natl, Acad, Sci, USA 90:7915-7922, 1993; Bast et al.. Am, J. Med, P7(Suppl. 3C):2S- 
138,1991). 

20 E X AM P LE? 

Breast Cancer . 

A. Tissue Acquisition, DNA Isolation and PCA/FT-IR Spectral 
Analysis: Tissues were obtained from local Seattle hospitals and The Cooperative 

25 Human Tissue Network (Cleveland, OH). A total of 12 tissues were obtained from 12 
patients wifli invasive ductal carcinojma of the breast but having no lymph node 
involvement (IDC), of which one was multifocal (the second focus being a signet ring 
cell carcinoma, which was not evaluated) and one was bilateral breast cancer (only one 
of which was evaluated). A total of 25 tissues were obtained from 25 patients with 

30 invasive ductal carcinoma having one or more lymph nodes positive for metastatic 
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cancer (IDC J. No unusual histologies occurred among the non-metastatic and 
metastatic groups with the exception of the two IDCs mentioned. Tumor size was 
based on the maximum dimension of the tumor, as recorded in the pathology reports. 
Non-cancerous breast tissue (RMT) was obtained firom 21 patients who had undergone 

5 hypennastia surgery (reduction mammoplasty). Routine pathology showed no celluiar. 
changes other than occasional non-neoplastic (e.g., fibrocystic) lesions in these tissues. 

After excision, each tissue was flash firozen in liquid nitrogen and stored 
at -80°G. DNA was isolated from the tissues, dissolved in deionized water, and 
aliquoted for FT-IR spectroscopy (~20pg). Each DNA sample was completely dried by 

10 lyophiUzation, purged with pure nitrogen, and stored in an evacuated, sealed glass vial 
at -80 e. All samples were analyzed by FT-IR spectroscopy. 

■ The IR spectra were obtained using The Perkin-Ehner System 2000 
equipped witti an 1-series microscope (The Perkin-Elmer Corp., Norwalk, CT). Each 
spectrum was specified by the absorbance at each integer wavenumber from 2000 to 

15 TOO.cm-. Only the interval from 1750 to 700 cm"', which included all major variations, 
among spectra, was included in this analysis. A baseline adjustment and normaUzation 
was carried out. One RMT. was represented by two sections. The mean of the two 
adjusted and normaUzed spectra was used in these analyses. The multipUcative 
normalizing factor was applied to absorbencies between i750 and 700 cm". Using 

20 deuterium exchange, no evidence was found to suggest that . absorbed moisture 
contributed to the spectral properties of DNA. 

B. Statistical Analysis: . For analysis of overall DNA structure 
employing FT-IR analysis. Principal Componraits Analysis (PCA) was used. PCA 
methodology is a statistical procedure applied to a single set of variables with the aim of 

25 discovering a few variables (components) that are independent of each other and which 
capture most of the information in the original, long Ust of variables. The methodology 
can greatly reduce the number of variables of concern. The PCs partition the total 
variance by finding the first PC (a linear combination of the variables) which accounts 
for the maximum amount of variance for the entire population. The PCA methodology 

30 then finds a second combination, independent of the first PC, such that it accounts for 
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the next largest amount of variance. This procedure continues until a number of 
. independent PCAs are found that explain a significant portion of the total variance. In 
the present context, PCA was a way to identify major features of 
absorbance-wavenumber . variation across a collection of specb-a and describe that 
5 variation succinctly. 

Using PCA, it is possible to identify a few components that serve as 
"building blocks" for the spectra. After the PCA, each spectrum can be represented by a 
few PC scores. PCA was carried out with the grand mean spectrum subtracted fi-om 
individual spectra. Prior to the analysis, it was decided to retain enough components to 

10 explain at least 90% of the total variation (around the mean) of the data set. To 
determine if some of the differences among spectra might be due to age, the correlation 
between age and each PC score was calculated. To visualize the spectral relationship of 
the -cancer and non-cancer groups (IDC„, IDC and RMT), plots were constructed based 
on their first tiiree PC scores. These two and three dimensional plots permit, the 

1 5 simultaneous examination of two or three of the most significant components of any 
single specimen data set and perrnit the meaningfiil comparison of each data set to one 
another. 

C. Principal Components Analysis of Spectral Profiles: Spectral 
profiles revealed great diversity of the IDC„ group and homogeneity of the IDC group. 

20 Figure 4 shows a three-dimensional representation of the spectra based on PCA. The 
position in this plot is determined by the absorbance spectrum, mainly expressed as the 
height, width and location of peaks. There is a core cluster of DDCs in the upper part of 
the plot (indicated by yellow spheres). The two IDCs in the lower left part of the plot 
are outliers well removed firom the core cluster. Notably, these are: 1) an IDC with a 

25 . second focus of signet ring cell carcinoma and 2) a bilateral breast cancer. As apparent 
firom the plot, both the IDC^ cluster (magenta) and the RMT cluster (blue) are 
considerably larger — indicating greater spectral diversity — than the core EDC cluster. 

-The size of a cluster can be measured and its spectral diversity 
represented by the mean distance of the members fi*om the centroid of the cluster. This 

30 distance can be expressed as an approximate percent difference in normalized 
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absorbance per wavenumber between a cluster member and the mean spectrum for the 
cluster, which lies at its centroid. The distance expressed as a percent difference is 
calculated as: a) 100% times the square root of the mean squared difference in 
normalized absorbance across wavenumbers 1750 to 700 cm"-, which is then b) divided 
5 by 1.0, the approximate mean normaUzed absorbance for most spectra. For the 
comparison of cluster sizes, three RMTs, three IDC^s and two IDCs that lay at outUer 
distances from the centroid in each group were removed to define a core cluster for flie 
RMT, IDC„, and IDC. All outUers had at least a 20% difference from any member of 
their cluster. Based on centroids and distances of the remaining cases, the spectral 
10 diversity (meaii distance from the centroid) was 12.4% for the IDC„ group, 7.3% for the 
IDC group, and 9.2% for tiip RMT group. An approximate P-value for the difference in 
diversity between groups was based on the Mann-Wliitney test, comparing distances to 
the centroids without outliers: P 0.003 for IDC vs. IDC„., P = 0.04 for RMT vs. IDC„ 
and P = 0.4 for RMT vs. IDC. (The revalues are approximate because dependence 
15 among distances is introduced through the calculation ofthe common centroid.) 

Based, on. initial PCA of the 58 samples (RMT, N=21; IDCm, N=25; 
IDC, N= 12). four outUers were detected--specimens whose FT-IR spectra departed 
strikingly from the rest of the group and which had outlier PC scores. The PCA was 
repeated initially eliminating these four outUers. The IPC scores were then calculated 
20 forthese outUers in a. manner similar to the others (subfracting the grand mean spectrum 
of the 54 samples and then projecting each of the residual spectra on the PC 
eigenvectors). It was found that 91% ofthe variation in absorbance of the 54 samples 
was explained by the first five components. This implies that variation among spectra 
is highly structured. The 1051 wavenumbers from 1750 to 700 cm"' constitute 
25 potentially 1 05 1 dimensions of variation. Over 90% of this variation can be represented 

by only 5 dimensions. " . 

There were only weak correlations of PC scores with age, but some 
correlations were statistically significant for all samples combined. Correlations 
between age and PC scores were as follows: r = 0.21 for component and age (P = 0.1), 
30 r = 0.29 for component 2 (P = 0.003), r = 0.03 for component 3 (P.= 0.8), r .= 0.25 for 
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component 4 (P = 0.06) and r = 0:30 for component 5 (P = 0.02). The small magnitude 
of these correlations suggests very little influence of age on spectral structure. Further, 
even the statistically significant correlations (PC-2 and PC-5) appear to be an artifact 
because correlations between the PC scores and age in the cancer and non-cancer 
5 groups separately are very weak — less the 0,18 in magnitude — and are non-significant 
(minimum P = 0.4). There is a broad range of ages for all groups which should allow a 
substantial true correlation to be detected: 17 to 89 for all samples, 26 to 89 for cancer 
(IDC„ and IDC) and 17 to 63 for RMT. There was also no statistically significant 
correlation of the PC scores with the number or percent of positive lymph nodes. 
10 Figure 5 A depicts the overlaid spectra of the two "outliers" ("A" and "B" 

in Figure 4) that lie close together on the three-dimensional PCA plot shown in 
Figure 5B. The actual spectra dififer by only a mean of 3% in normalized absorbance, 
indicating high precision in characteriziiig spectral phenotypes. The two IDC outliers 
mentioned earlier are also, distinct in spectral profile fi-om the core IDC cluster. 
15 Figures 6A and 6B show these two spectra superimposed on the mean normalized 
spectrum of the IDC core cluster. Differences are notable over most of the spectral ' 
area, but especially in the following regions: 1700 to 1350 cm", the peak at about 1240 
cm *, and about 1180 to 900 cm"*. These regions generally represent N— H and C — O 
vibrations of the bases, PO2 anti-symmetric stretching vibrations of phosphodiester 
20 groups, and C — O vibrations of deoXyribose, respectively. 

It was described above that the centroid for a related data set (e.g., IDC 
specimens) could be calculated wherein the centroid would be considered the weighted 
mean for the spectra associated with a particular species of specimen. Such an activity 
is shown in Figure 7 for PCI and PC2 values for the three types of specimens subject to 
25 . analysis. In this figure, the vector firom the centroid for RMT specimens to the centroid 
for the IDC specimens is shown on the left hand side of the graph and represents the 
shift of spectral profiles fi-om a RMT to an IDC state. This direction constitutes an 
initial direction and establishes a reference for comparison to the vector derived fi*om 
the IDC centroid to the IDC^ centroid. The degree of vector rotation, relative to the 
30 RMT— IDC vector, is shown in Table 3. 
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table 3 



Spectral Region 


Change in 
Direction 


95% Confidence 
Interval for Change. 


P Value 


1750-700 (cm-') 


94" 


66-129" 


O.OOl 


1750-1550 (cm ') 


86'' 


52-127" 


<0.001 


1549-1300 (cm"') • 


127" 


. 93-154" 


<0.001 


1299-1200 (cm ') 


113" 


77-164" 


<0.001 


1199-850 (cm-') 


i08" 


.65-146" 


<0.001 


849-700 (cm-') 


. 83" 


28-148" . 


<0.001 



It therefore can be seen that the effect on DNA from the IDC state to the 

5 IDCrt state is not only widespread over the analyzed spectrum, but relatively consistent. 
Moreover, the implication of this directional change lends support to the proposition 
that as attacks continue on DNA, there is a definite, quantifiable, and predictable 
movement of the DNA spectral profile from one state to another. 

Figures 8 and 9 are presented to emphasize the spectral distinctiveness 

10 between the three species of specimens. In Fig. 8, the spectra for each centroid for each 
species is shown. After having subtracted out the grand mean from these curves, the 
mean deviations for each species make readily discernible the distinguishing spectra 
inherent between the species as is best shown in Fig. 9. 

In Figure 10, a generally signaoid curve is established using data sets 

15 generated by FT-IR. The transition from npn-caicer to cancer is sharp, indicating that 
the manifestation of cancer can ultimately be initiated by a relatively smaU incentive, 
dq)ending upon the "location" of the sample on the curve. 

D. Alternative Means for Tissue Acquisition and Long-Term 
Storage: As an alternative means to the above described method for obtaining, and 

20 preserving specimens for FT-IR analysis, it may be desirable to embed the specimen in 
a paraffin block after acquisition and initial preparation. When analysis of the specimen ■ 
is desired, the paraffin-embedded tissue (PET) is dewaxed and the DNA is isolated by 
using conventional techniques such as application of phenol and/or chlproform 
solutions. After detennining the purity of the specimen, the . DNA is placed in an 
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aqueous solution, dried under vacuum, and applied to the barium fluoride window for 
analysis by FT-IR. 

The use of PET specimens for spectral analysis greatly increases the 
number of samples available for DNA analysis since it is not be necessary to wait and 
5 obtain special biopsies for analysis (specimens could be easily stored and retrieved at a 
later time), and perrnits retrospective follow-up studies of the same tissue specimens to . 
be conducted rapidly and economically. 

10 Liver Cancer 

A. Material and Methods: English sole were obtained from a 
relatively clean rural environment [Quartermaster Harbor, WA] and a chemically 
contaminated urban environment [Duwamish Riv^, Seattle, WA]. Their livers were. 
15 examined histologically and foxmd to be cancer-free, although they contained various 
non-neoplastic lesions characteristic of fish from contaminated environments. 

The Duwamish River flows into Puget Sound through a heavily 
industrialized area. The sediments contain a variety of carcinogens and other 
xenobiotics, such as poljmuclear - aromatic hydrocarbons and chlorinated pesticide 
20 residues; howeyer, a restoration program is in progress to reduce the sediment 
contamination. 

Two groups of sole were obtained from the Duwamish River (DUW93, 
n == 8; and DUW95, n =^ 10). Because of the restoration program, the DUW95 samples 
were expected to reflect significantly less sediment contairiinatipn than the bUW93 
25 sarnples, but greater than the QMH samples. Fish from Quartermaster Harbor, WA, 
served as controls (QMH, n = 7). The lengths ± SD of the QMH, DUW95 and DUW93 .. 
fish were 29.5 ± 4.2 cm, 23.6 ± 1.6 cm and 24.1 ± 0.8 cm, respectively. The weights 
were 254.3 ± 115.0 g, 125.6 ± 16.2 g and 125.0 ± 22.5 g. 

Isolation of DNA from hepatic tissue and PC A analyses of FT-IR spectra 
30 were undertaken as described above. Each FT-IR spectrum was normalized over the - 
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range 1750 to 700 cm' ^ PCA was used to identify a few variables (components) that 
capture most of the information in the original, long list of variables (the spectral 
absorbancies at each of the 1051 wavenumbers form 1750 to 700 cm"'); PC scores were 
calculated with the grand mean of all spectra subtracted from each spectrum. Thiis, the 

5 PC scores represent variations in spectral (structural) features as they differ from the 
grand mean spectrum. The Kruskal-Wallis (KW) test and the Mann-Whitney (MW) test 
were used to calculate the statistical significance of differences in PC scores between 
groups. The same procedures were used to test for differences in spectral diversity, 
which was defined for a group as the mean distance of spectra to the group centroid. 

10 The imequai variance t-test was used to compare the mean normalized absorbance 
between groups. The t-test was carried out at each of the 1051 wavenumbers from 1750 
- 700 cm'^ Fish age, reflected in length and mass, was a potentially confoimding 
variable and this possibility was addressed in the analysis. 

B. Results: Figure 11 shows a PCA for the first three PC scores 

15 using specimens obtained from a location known not to be polluted (blue spheres); 
specimens obtained from an area known to be polluted (yellow spheres); and specimens 
obtained from the same polluted area prior to significant clean-up and/or environmental 
actions to remove polluted sediment (maroon spheres). As can be seen through 
inspection of the figure, a distribution similar to that encountered with breast tissue is 

20 present in the DNA of fish liver. 

The clusters of points derived flx>m the first three PC scores, which 
' summarize spectral features of the DNA from the QMH and DUW groups, are shown in 
a three-dimensional projection (Figure 11). The hypothesis, that all groups have the 
same mean values of PC scores (thus, similar spectra) is rejected (KW P-value <0.001) 

25 and the hypothesis that any two of the groups have the same mean values of PC scores 
is also rejected (MW P-value 0.04 to <0.001). The three groups are distinct without any 
' overlap (Figure 11). PCI and PC2, combined, account for 94% of the spectral variation 
. and thus proyide a good means for representing the variety of spectra encountered. PC3 
is used for display purposes (Figure 11), although it explains only 3% of the spectral 

30 variation. 
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The differences between groups occur at many frequencies. The upper 
part of each panel in Figure 12 shows the mean spectrum for each of two groups (QMH 
- DUW93; QMH - DUW95, and DUW95 - DUW93). The bottom part of the panel 
shows P-values for each spectral comparison,- one Prvalue per wavenumber. The 
5 comparisons yield P 0.05 at 78 - 87% of the 1051 wavenumbers, thus demonstrating 
that the structures of the DNAs from the DUW93 and DUW95 groups are markedly 
different from each other and the QMH group. Accordingly, the findings substantially 
invalidate the null hypothesis that the mean, normalized spectra are equal between 
groxq)s. The spectral differences are notable with respect to the antisymmetric 
stretching vibrations of the PO2 structure (« 1240 cm;'). The band at this spectral 
region is present in the QMH group, but is virtually lost in the spectra of the DTJW93 
and DUW95 groups. Other major differences are evident in spectral regions 
representing vibrations associated with the nucleic acids (» 1700 to 1450 cm'^) and 
deoxyribose (« 1 1 50 to 950 cm"*). 

It is obvious (Figure 11) that the samples can be 100% correctly 
classified into groups (separated) on the basis of the PC scores (Table 4). 

Table 4 

Principal component scores by group and statistical significance 
of differences between groups 

Variables. QMH DUW95 • DIJW93 KW. MW MW MV^ • 

n = 7 n = 10 n = 8 P-vahie P- value P-valuc . P-value 

Mean ±SD Mean ±SD MeaniSD for overall for QMH for QMH for DUW93 

differences vs. vs. vs. 

, ' DUW93 DUW95 DUW95 

Principal 
component 

PCI -6.1 ±1.4 -12.8 ±2.8 21.3 + 12.3 <0.Q01 O.OOl <0.001 <0.001 

PC2 6.1 ±1.3 -3.3 ±2:6 -1.3 ±1.4 <0.001 <0,Odl <0:001 0.04 
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Figure 1 1 shows that the diversity of spectra (note the spread of points) 
is substantially greater in the DUW03. and DUW95 groups, compared to the QMH 
group. The varying diversity between the groups and the spectral differences which 

5 separate them are also evident in Figure 13 in which the' individual spectra or each 
group are overlaid. The tightness of the QMH spectra and the increasing spectral 
diversity from the QMH to the Duwamish River groups is notable in the region « 1700 
to 1450 cm-' , which includes strong C-O stretching and NH, bending vibrations of the 
nucleic acids. Also in the DUW93 group, compared to the other groups, there is a 

10 pronounced increase in absorbance and spectral diversity in the 1400 cm"' region 
assigned to weak NH vibrations and CH in-plane deformations of the nucleic acids. 
The region « 1150 to 950 cm"', which includes strong stretching vibrations associated 
with deoxyribose. increases in spectral diversity from QMH to PUW95, but tightens in 
the DUW93 group. The differences between the spectral properties are consistent with 

is . the discrimination between groups shown in Table4 and the increased diversity of the 

clusters illustrated in Figure 11. 

A fonnal test for diversity differences (KW test for the null hypothesis 

that all groups have the same mean distance to the group centroid) yields P = 0.002, 

strongly suggesting unequal diversity among groups. These mean distances to the 
20 centroid provide a scde for measuring diversity: A larger mean distance indicates that a 

group is more spread out (Figure 1 1); that is, the spectra are more diverse. The DUW95 
. group has a mean distance which is four times that of the QMH group, representing a 

four-fold greater diversity (Table 5). Two of the three pairwise comparisons of 

diversity are significant (p <0.05); however, the cbmpaison between the DUW95 and 
25 DUW93 groups is not significant (MW P-value = 0.2), although the DUW93 group 

(representing DNA with the most altered base structure) is more diverse than the 

DUW95 group. 
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Spectral diversity for three groups 



Group 


Distance to group centroid 


N 




(diversity) . 






Mean ±SD 




QMH 


2,5 ±1.0 


7 


. PUW95 


5.8 ±2.0 


10 


DUW93 


10.2 ±7.2 


8 



P-valiies for null hypotheses: (1) all three groups have 
5 ' the same mean diversity, KW P-value = 0.002; (2) Mean 

QMH = Mean DUW95, MW P-value = 0.003; (3) mean 
QMH = mean DUW95, MW P-value = 0.2 

The varying diversities of the groups is unlikely due to age variables. 

10 The QMH group is the most diverse in length and mass, yet it shows the least spectral 
. diversity. The QMH group shows a length SD that is two to five times larger than that 
of the DUW95 and DUW93 groups and a mass SD that is five to seven times larger. 
However, the mean distance of the QMH spectra to their centroid is two to four-fold 
smaller than that of the Duwamish groups. These results would be highly inconsistent 

15 if age were a significant factor in spectral diversity. Length and mass also appear to 
have httle effect in creating the spectral differences by location (Figure 11). In 
regression analysis, length and mass combined explained only 7% of the variation in 
PCI an4 40% of the variation in PC2. PCI is by far the more important component in 
explaining spectral diversity. Length, md mass explain only about 9% of the overall 

20 spectral variation, whereas location explains 77%. 

The DNA structures isolated from the QMH, DUW95 and DUW93 fish 
were each urtique in that the PC plot revealed a complete separation of clusters 
(Figure 11). In addition, the DNAjs from the exposed groups were substantially mor^e 
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diverse than those of the control group and the DUW93 group was more diverse than 
the DUW95 group (Table 5, Figure 11). These distinctions, . which were not. 
. significantly age-related, likely arose from structural features induced in DNA by 
different environmental factors. Among the environmental factors likely contributing to 
5 the cluster separations and the differences in diversity are the type, degree and duration 
of exposure to toxic chemicals in the sediments. Striking differences occurred between 
the three groups in regions of the spectra assigned to the nucleic acids and the 
phosphodiester-deoxyribose stmcture (Figures 12 and 13), suggesting that alterations in 
these ' structures contributed substantially to the separation of clusters and the 
10 diffarences in diversity among groups. 

There was a statistically significant increase in the diversity of clusters 
representing the two Duwamidi River groups, compared to the tight cluster of the 
reference group (Figure 1 1 ; Table 5). Increased diversity may be especially important 
ill carcinogenesis in that it sets the stage for the selection of DNA forms that give rise to 
1 5 malignant cellular phenotypes. The high degree of diversity in the exposed fish groups 
may serve the same fimction. 

Cluster separation in PC plots was described above in studies of prostate 
(Example 1) and breast (Example 2) cancer. With the prostate, for example, perfect 
discrimination was achieved between DNA from normal and adeaiomacarcinoma tissue. 
20 Similarly, perfect discrimination was obtained between clusters in this Example, thus 
demonstrating that the DNA structures had unique properties representing new forms of 
DNA. Considering that fish in the Duwamish River are prone to Uver tumors, the 
distinctly different forms of DNA found in the DUW95 and DUW93 groups likely 
constitute critical stages in the progression to cancer. 
25 This Example has shown that damage to the DNA of English sole 

exposed to environmental chemicals leads to new, diverse forms of DNA. These-new 
forms may play a pivotal role in carcinogenesis and ultimately contribute to the 
development of liver cancer in the fish population. In addition, the results raise the 
question whettier environmental chemicals play a role in generating the new forms of 
30 DNA found in breast and prostate cancers as described above. 
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All publications and patent applications mentioned in this specification 
are herein incorporated by reference to the same extent as if each individual publication 
or patent application was specifically and individually incorporated by reference. 

From the foregoing, it will be evident that, although specific 
embodiments of the invention have been described herein for purposes of illustration, 
various modifications may be made without deviating from the spirit and scope of the 
invention. 
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CLAIMS 

1 . A method for defining the state of tissue comprising the steps: 

(a) subjecting DNA from a first plurality of tissue samples to Fourier 
transfonn-infrared(FT-IR) spectroscopy to produce FT-IR spectral data; . 

(b) analyzing the FT-IR spectral data of step (a) by principal components 

analysis (PCA) to provide a principal component (PC) scores; 

(c) . .^plying cluster analysis to the PC scores of step (b) to distinguish 
outlier and non-outlier tissue samples; and 

(d) generating an equation, called a first equation, that defines a 
multivariate version of a normal bell-shaped curve which best fits the PC values from the 
non-outlier tissue samples, where the. first equation defines the. state of the first plurality of 
tissue samples. '■■ 

2. A method according to claim 1. fiirther comprising repeating steps (a) 
through (d) with a second pluraUty of tissue samples, to provide a second equation, where the 
second equation defines the state of thfe second plurality of tissue samples. 

3. A method according to claim 2, further comprising applying 
multivariate discrimination analysis to the first and second equations, to provide first and 
second probabiUty equationsi respectively. 

4. A method according to claim 3, fiirther comprising the steps: 

(e) subjecting a DNA sample from a tissue having a state of interest to FT- 
IR spectroscopy to produce FT-IR spectral data; 

(f) analyzing the FT-IR spectral data of step (e) by PCA to provide a set of 

PC scores; and 

(g) combining the PC scores of step (f) with each of the first and second 
probability equations tp provide first and second probability scores, respectively. 
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5. A method according to any of claims 1 wherein the tissue is breast, 
urogenital, liver, renal, pancreatic, lung, blood, brain or colorectal tissue! 

6. A method according to claim 1 wherein the tissue is cancerous tissue.. 

7. A method according to claini 6 wherein the tissue is cancerous breast, 
prostate, ovarian or endometrial tissue.' 

8. A method for assessing the genotoxicity of an environment comprising 

flie steps of: 

(a) subjecting DNA from a plurality of first organism in a first 
environment to Fourier transform-infrared (FT-IR) spectroscopy to produce FT-IK spectral 
data; 

(b) analyzing the FT-IR spectral data of step (a) by principal components 
analysis (PCA) to provide a principal component (PC) scores; 

(c) applying clustCT analysis to the PC scores of step (b) to distinguish 
outlier and nonroutlier organisms; and 

(d) generating an equation, called a first equation, that defines a 
multivariate version of a normal bell-shaped curve which best fits the PG values from the 
non-outlier organisms, where the first equation defines the first organisms in the first 
environment. 

9. A method according to claim 8, fiirther comprising repeating steps (a) 
through (d) with second organisms from a second environment, to provide a second equation, 
where the second equation defines the state of the second organisms in the second 
environment. 

10. A method according to claim 9, fiirther comprising applying 
multivariate discrimination analysis to the first and second equations, to provide first and 
second probability equations, respectively. 
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11. A method according to claim 1 0, further comprising the steps: 

(e) subjecting a DNA sample, of an organism of interest from an 
environment of interest to FT-IR spectroscopy to produce FT-IR spectral data; 

(f) analyzing the FT-IR spectral data of step (e) by PCA to provide a set of 

PC scores; and 

(g) combining the PC scores of step (f) with each of the first and second 
probability equations to provide first and second probability scores, respectively. 

12. A method according to claim 9 wherein at least one of the first and 
second enviioranents is a polluted environment 

13. A method according to claiin 9 wherein the first and second organisms 
are non-identical, however the. first and second environments are identical. 



- 14. A method acconiing to claim 9 wherein the first and second organisms 
liidentical, however the first and second environments are non-identical. 
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